89
DOI: 10.1201/9781003355205-3
C h a p t e r 3
De Novo Genome Assembly
3.1 INTRODUCTION TO DE NOVO GENOME ASSEMBLY
In the previous chapter, we discussed reads mapping to the reference genomes of organ-
isms which have available reference genome sequences and we also discussed the refer-
ence-based genome assembly. So, what if there is no reference genome available for an
organism or we need to sequence a genome of an unknown organism. In this case, aligning
reads to a reference genome is not possible and we shall assume no prior knowledge about
the genome of that organism or its length, or composition. Thus, de novo genome assembly
comes into play. It can also be used for a species with a solved reference genome for vari-
ant discovery if we need to avoid any bias created by a prior knowledge. De novo genome
assembly is a strategy to assemble a novel genome from scratch without the aid of a refer-
ence genome sequence. Because of the improvements in cost and quality of DNA sequenc-
ing, de novo genome assembly now is widely used specially in metagenomics for bacterial
and viral genome assembly from environmental and clinical samples.
The de novo genome assembly aims to join reads into a contiguous sequence called
a contig. Multiple contigs are joined together to form a scaffold and multiple scaffolds
can also be linked to form a chromosome. The genome assembly is made of the consen-
sus sequences. Both single-end and paired-end or mate-pair reads can be used in the de
novo assembly, but paired reads are preferred because they provide high-quality align-
ments across DNA regions containing repetitive sequences and produce long contigs by
filling gaps in the consensus sequence. Assembling the entire genome is usually challeng-
ing because of the presence of numerous stretched tandem repeats in the genome. These
repeats create gaps in the assembly. Gap problem can be overcome by deep sequencing,
which is sequencing a genome multiple times to provide sufficient coverage and sequence
depth, which increase the chance for read overlaps. The sequencing coverage is defined as
the average number of reads that align to or cover known reference bases of the genome
and it is estimated as follows [1]:
(
)
=
×
coverage
read length
number of reads
haploid genome length bp
(3.1)